韩国专利KR20040034606A Method and system using a data-driven model for monocular face tracking

专利PDF首页>>韩国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
Describes a system and method using a data-driven model for monocular face tracking, which uses a single camera to track three-dimensional (3D) images such as faces, for example. It provides a versatile system. In one method, stereo data is obtained based on an input image sequence. The 3D model is constructed using the acquired stereo data. Monocular image sequences are tracked using the constructed 3D model. For example, principal component analysis (PCA) can be applied to stereo data to learn possible face deformations and to build a data driven 3D model (“3D face model”). The 3D face model can be used to approximate a general shape (eg, face posture) to a linear combination of shape base vectors based on PCA analysis.
公开号:KR20040034606A
申请号:KR10-2003-7014619
申请日:2002-05-02
公开日:2004-04-28
发明作者:그르제스츠주크라테크；보구에트진-이브；고크투르크사리흐
申请人:인텔 코오퍼레이션；
IPC主号:

专利说明:

Systems and Methods Using Data Driven Models for Monocular Face Tracking {METHOD AND SYSTEM USING A DATA-DRIVEN MODEL FOR MONOCULAR FACE TRACKING}
[2] Monocular face tracking is a process of statistically predicting face motion, position and shape based on a monocular image sequence of a fixed camera. Monocular face tracking is an important process in many video processing systems, such as video conferencing systems. For example, predicting facial motion or position in a videoconferencing system reduces the amount of facial data or information that must be replaced or processed. That is, parameters related to predicted face movement, position and shape may be replaced or processed for the image sequence output instead of replacing or processing a large amount of image data.
[3] One kind of face tracking system is a face tracking system based on markers (“marker face tracking system”). In a marker face tracking system, the user must be wearing color "markers" at a known location. Thus the movement of the markers is parameterized to predict face position and shape. A disadvantage of the marker face tracking system is that it annoys the user. In particular, the user must place a number of color markers in the changing position of the face. In addition, the user must spend time attaching the marker, which further increases the complexity of using such a system.
[4] Another type of face tracking system is a model based face tracking system. The model based face tracking system uses a parameterized face shape model that can be used to predict face position and movement. In conventional model-based face tracking systems, parameterized models are built using a manual process, for example by using a 3D scanner or Computer Aided Design (CAD) modeler. Therefore, a disadvantage of the conventional model based face tracking system is that the manual construction of face shape models is very ad hoc, which leads to a trial and error approach to obtaining tracking models. This very ad hoc process provides an inaccurate and suboptimal model.
[1] The present invention generally relates to the field of image processing. In particular, the present invention relates to systems and methods for using a data-driven model for monocular face tracking.
[11] 1 illustrates an example of a computing system implementing the present invention.
[12] 2 is a flowchart of an operation of performing monocular tracking using a data driven model according to an embodiment.
[13] FIG. 3 is a diagram illustrating an example of a stereo tracking stereo input image sequence for building the data driven model of FIG. 2.
[14] 4 is a diagram illustrating an example of deformation of a four-dimensional space learned from a stereo input sequence.
[15] 5 is a diagram illustrating an example of an input image sequence for monocular tracking.
[16] 6 is a flowchart of an operation of performing stereo tracking of FIG. 2, according to an exemplary embodiment.
[17] FIG. 7 is a flowchart for calculating the principal shape vector of FIG. 2 according to one embodiment. FIG.
[18] 8 is a flowchart of performing the monocular tracking of FIG. 2 according to an embodiment.
[5] A system and method using a data driven model for monocular face tracking provides a versatile system for tracking three-dimensional (3D) objects, such as faces, in an image sequence acquired using a single camera. In one embodiment, stereo data based on the input image sequence is obtained. The 3D model is constructed using the acquired stereo data. Monocular image sequences are tracked using the constructed 3D model. In one embodiment, principal component analysis (PCA) is applied to stereo data, for example, to learn possible facial deformations and to build a data driven 3D model ("3D face model"). The 3D face model can be used to approximate a general shape (eg, face pose) as a linear combination of shape basis vectors based on PCA analysis.
[6] By using real stereo, one can calculate a small number of shape basis vectors to build a 3D model, which offers many advantages. For example, an optimal small number of shape basis vectors (e.g., 3 or 4) can be used to span various facial expressions, for example smiling, talking, raising eyebrows, and the like. In addition, the 3D model may be built using stereo data from one or more users and stored in a database, for example, tracking the face of a new user even if stereo data from the new user is not stored in the database.
[7] In addition, by constructing a 3D model using stereo data based on the input image sequence, monocular face tracking for face deformation and pose can be realized without the use of cumbersome and cumbersome markers. The 3D face model described herein provides a low complexity deformable model for simultaneous tracking of face deformation and pose from a single image sequence ("monocular tracking").
[8] The following example describes a system for tracking both 3D poses and shapes of facial images ("faces") in front of a single video camera without the use of cumbersome markers. The system is also robust and robust using a data driven model. Provides robust monocular tracking. In addition, the system provides generalization of properties to enable many people's face tracking with the same 3D model.
[9] In the following description, the monocular tracking technique is described in connection with tracking of a 3D face image. However, the monocular tracking technique described herein is not intended to be limited to any particular type of image and may be implemented using other types of 3D images, such as moving body parts or inanimate objects.
[10] The present invention has been described for purposes of illustration and is not limited by the accompanying drawings, wherein like reference numerals in the drawings indicate like elements.
[19] survey
[20] An example of a computing system
[21] 1 illustrates an example of a computing system 100 for practicing the present invention. The 3D model building techniques and monocular tracking techniques described herein can be used and implemented by the computing system 100. Computing system 100 may refer to existing, for example, general purpose computers, workstations, portable computers, hand-held computing devices, and other computing devices. Components of computing system 100 are illustrative, and one or more components may be omitted or added. For example, a plurality of camera devices 128 may be used with the computing system 100.
[22] Referring to FIG. 1, the computing system 100 is a coprocessor coupled to the display circuit 105, the main memory 104, the static memory 106, and the flash memory 107 via a bus 101. a main unit 110 having a processor 103 and a central processing unit (CPU) 102. The main device 110 of the computing system 100 is also connected via the bus 101 to the display 121, keypad input 122, cursor control 123, hardcopy device 124, input / output (I / I). O) device 125, mass storage device 126, and camera device 128.
[23] Bus 101 is a standard system bus for communicating information and signals. CPU 102 and coprocessor 103 are processing units of computing system 100. CPU 102 or coprocessor 103, or both, may be used to process information and / or signals of computing system 100. The CPU 102 may be used to process code or instructions that execute the 3D data driven model building techniques and monocular tracking techniques described herein. Alternatively, coprocessor 103 may be used to process code or instructions to execute the same techniques as CPU 102. The CPU 102 includes a control unit 131, an arithmetic logic unit (ALU) 132, and several registers 133, which are to be used by the CPU 102 for data and information processing purposes. Can be. Coprocessor 103 may also include components similar to CPU 102.
[24] Main memory 104 may be, for example, random access memory (RAM) or other dynamic storage device that stores data, code, or instructions to be used by computing system 100. In one embodiment, main memory 104 may store data associated with an input stereo image sequence and / or 3D data driven model, as described in more detail below. Main memory 104 may also temporarily store variables or other intermediate data while code or instructions are being executed by CPU 102 or coprocessor 103. Static memory 106 may be, for example, read-only memory (ROM) and / or other static storage, and may store data and / or code or instructions to be used by computing system 100. The flash memory 107 is a memory device that can be used to store basic input / output system (BIOS) code or instructions.
[25] The display 121 may be, for example, a cathode ray tube (CRT) or a liquid crystal display (LCD). The display 121 may display an image, information, or graphic to a user. The main device 110 of the computing system 100 may interface with the display 121 through the display circuit 105. The keypad input unit 122 is an alphanumeric input device for information communication and command selection of the computing system 100. The cursor controller 132 may be, for example, a mouse, a touch pad, a trackball, or a cursor direction key for controlling the movement of an object on the display 121. The hard copy device 124 may be, for example, a laser printer for printing information on a medium such as paper, film, or the like. Any number of input / output (I / O) devices 125 may be connected to the computing system 100. For example, an I / O device such as a speaker may be connected to the computing system 100. The mass storage device 126 may be, for example, a mass storage device such as a hard disk, a read / writeable CD or a DVD flare. Camera device 128 may be a video image capture device and may be used in the image processing techniques described herein. In one embodiment, camera device 128 includes a Digiclops ^™ camera system that provides color images of 640 × 480 size at an average frame rate of 4fps.
[26] In one embodiment, the 3D data driven model building techniques and monocular tracking techniques described herein may be implemented with hardware and / or software modules included within computing system 100. For example, CPU 102 or coprocessor 103 may, for example, process main input 104 or static memory (such as main memory 104 or static memory) to process a stereo input sequence for building a 3D data driven model as described herein. Code or instructions stored in a machine-readable medium such as 106). Further, the CPU 102 or coprocessor 103 may execute code or instructions for tracking the monocular input image using the 3D data driven model as described herein. The memory devices in host device 110 may also be.
[27] Machine-readable media may include a mechanism for providing (ie, storing and / or transmitting) information in a machine-readable form, such as a computer or digital processing device. For example, machine readable media may include ROM, RAM, magnetic disk storage media, optical storage media, flash memory devices, and other memory devices. Codes or commands may be represented by carrier signals, infrared signals, digital signals, and other signals. Machine-readable media can also be used to store a database for the 3D data driven model described herein. Furthermore, one or more machine readable media can be used to store the 3D model.
[28] Default behavior
[29] 2 illustrates a functional flow diagram of an operation 200 for performing monocular tracking using a data driven model according to one embodiment. Referring to FIG. 2, operation 200 includes two steps. The first step is referred to as operation block 210 or learning step 210. In the learning step 210, a PCA processing real stereo tracking data is applied to construct a 3D data driven model for monocular tracking to learn a space capable of face deformation. The 3D data driven model can be used to approximate a general shape as a linear combination of shape basis vectors. The second step is an operation block 220 in which monocular tracking is performed using the 3D data driven model constructed in the learning step. By using a 3D data driven model, for example, deformations and poses of an image such as a face can be tracked together from a monocular or single image sequence. Initially, operation 200 begins with a learning step 210.
[30] At operation block 202 in the learning step 210, a stereo sequence is input. For example, the camera device 128 may include a first camera and a second camera to acquire an image sequence from a left perspective and a right perspective, as shown in FIG. 3. As illustrated in FIG. 3, the first and second cameras may acquire an image sequence (eg, frames 1 to 100) of a person who exhibits changing face movements and poses from the left and the right views. The stereo input sequence may be input to computing system 100 for processing.
[31] At operation block 204, the input stereo sequence is tracked. In particular, a low complexity face mesh (eg, 19 points at varying positions of the face as shown in FIG. 3) is initialized and then tracked using a standard optical flow technique. do. In order to handle non-rigid deformation of the face, each point is independently tracked to obtain a facial shape trajectory.
[32] At operation block 206, PCA processing is initiated for the shape trajectory obtained from the tracked input stereo sequence. PCA is a mathematical process that optimally predicts low-dimensional representations of data contained in high-dimensional space. PCA processing is for obtaining the main shape vector of a compact deformable 3D shape model ("3D shape model"), which is used in monocular tracking.
[33] In operation block 208, the principal shape vector is calculated, which will be described in more detail later. Once the principal shape vector is calculated, any facial movement or pose during monocular tracking can be approximated with a linear combination of principal shape vectors.
[34] In operation block 220 (second step), monocular tracking may be performed on the monocular input sequence using the calculated model. A monocular sequence is a sequence consisting of images from a single camera. For example, in each frame of the monocular input sequence (e.g., frames 1 to 72) as shown in FIG. 5, the face shape is determined by a linear combination of the principal shape vectors of the calculated model built in the learning step 210. Can be approximated. In particular, while a person changes facial expressions and poses, the resulting optical flow information of the sequence can be used with a calculated model to track changes in poses and facial expressions.
[35] The operation can be implemented inside of the exemplary computing system 100. For example, the CPU 102 may execute code or instructions to build a 3D model and perform PCA processing, which will be described in more detail later. The data driven 3D model may be stored in a memory storage of the computing system 100. In one embodiment, the data driven 3D model is a “deformable face model”, which is described.
[36] Deformable face model
[37] The following description describes the parameterization needed to generate a deformable face model based on stereo tracking data and monocularly track the deformable face model. For example, referring to FIG. 5, the monocular face sequence can be tracked in 3D space using the deformable face model described herein.
[38] First, let I _{n be} the n th (n ^th ) image of a monocular face sequence with 72 frames as shown in FIG. 5. The 3D structure of each face in each frame at time n is N points It can be represented as a set of. In order to perform monocular tracking, a coordinate vector of a face reference frame and a camera reference frame should be defined. In particular, X ⁱ (n) and Points within the face frame and camera frame, respectively. Let 's be the coordinate vector of.
[39] Vector X ⁱ (n) and Are then correlated with each other as follows through a rigid body transformation that characterizes the pose of the user's face with respect to the camera at time n.
[40]
[41] Where R _n is a 3x3 rotation matrix and t _n is a translation vector. In order to track each face of each frame, the amount of X ⁱ (n) for the shape as a non-rigid object and R _n and t _n for the pose must be predicted as shown in FIG. 5. Since R _n is a rotation matrix, R _n is a 3-vector known as the rotation vector. Is uniquely parameterized. Rotation matrices and trajectory vectors can be correlated with one another using standard formulas.
[42] Data in the images I _n , n = 1, 2, ... M (eg, frames 1 to 72) can be used to predict the shape and pose for each face in each frame. Especially, Video I _n Top Is called the projection of of Let 's be the image coordinate vector of. Thus, in one embodiment, the conventional pinhole camera model is as follows. Image coordinate vector for projection of Can be used to determine
[43]
[44] Thus monocular tracking poses with 3D geometry X ⁱ (n) It may be equivalent to inverting the projection map π to recover.
[45] In one embodiment, the non-rigid shapes can be based on a linear combination of rigid formations to perform monocular tracking of non-rigid shapes (eg, changing facial expressions and poses). By making non-rigid shapes based on a linear combination of rigid shapes, one can avoid processing image projection points for an infinite number of changing shapes and poses. Thus, at any time n in the sequence, the shape coordinate vector X ⁱ (n) is the mean shape. Vector and handful of known geometry vectors It can be the sum of the linear combination of and is the main shape basis vector, as shown in Equation 1 below.
[46] [Equation 1]
[47]
[48] In Equation 1, p << 3N, p coefficient Represents entities that allow for non-rigidity of the 3D shape. If p = 0, the face shape X ⁱ (n) is rigid Becomes Thus "p" is called "dimensionality of transformed space". Image projection map is a pose parameter , t _n and Transform vectors with multiple "strain coefficients, such as Can be reduced as a function of The image projection map can thus be calculated using Equation 2 shown below.
[49] [Equation 2]
[50]
[51] Thus, the monocular tracking procedure can be performed by combining optical flow constraints (such as Lucas-Kanade) in a specific form of a deformable model, which transforms the deformation vector in every frame. , Pose parameters In case of simultaneous prediction of, and t _n , it is represented by equation 1. The monocular tracking procedure is described in more detail below.
[52] Before performing the monocular tracking procedure, the principal shape basis vector of Equation 1 , Which is performed in the learning step 210 as shown in FIG. Main geometry basis vector By using, the data driven model can be used to avoid manual construction of the non-rigid model. The principal shape basis vector is generated from the actual 3D tracked data, which is also performed in the learning step 210 as shown in FIG. In particular, calibrated stereo cameras are used to track 3D changing facial expressions and poses. For example, a short stereo input sequence of approximately 100 to 150 frames (eg, as shown in FIG. 3) can be used.
[53] Thus, the main geometric basis vector Can be calculated from the sequence tracked at operation blocks 202 and 204 using PCA processing. The processing of operation blocks 202 and 204 provides the stereo tracking necessary to obtain 3D orbital data for the purpose of shape deformation analysis.
[54] Stereo tracking
[55] 6 illustrates a flow diagram of operation 204 of FIG. 2 to perform stereo tracking according to one embodiment. Initially, operation 204 begins at operation block 604.
[56] In operation block 604, the set of points for the left camera image and the right camera image is initialized. In one embodiment, the set of points P ⁱ with N = 19 located on the eyes 2, nose 3, mouth 8, and eyebrow 6 are left camera image and right as shown in FIG. It is initialized for camera image. In this motion, the face transformation that changes so that the user keeps the head pose as fixed as possible throughout the sequence, with various other facial expressions, such as opening and closing mouths, smiling, raising eyebrows, and so on, is independent of pose. Is provided. In one embodiment, the set of points is displayed on the first and right and left camera images by the user of computing system 100. Thus, a stereo picture sequence can be tracked using these points.
[57] Note that not all points should be within the textured area of the image. This is a requirement for independent feature tracking, which clearly indicates "good to track" but not model-based tracking. For example, the tip of the nose is completely within the textureless region, and the mouth contour points and the points on the eyebrows are edge features. It would be impossible to track all of these points individually using conventional optical flow techniques.
[58] At operation block 604, the set of points is tracked by stereo triangulation. Stereo tracking is the location of each point (In the left camera reference frame) is updated so that its current left and right image projections are approximately matched with previous image projections (ie, temporary tracking).
[59] Video registration cost
[60] In one embodiment, to maintain stereo correspondence throughout stereo tracking, the left and right image projections will be approximately matched by considering the measured cost function between the left image and the right foot image. In particular, stereo tracking of points P ⁱ in frames n-1 to n can be done by minimizing the cost function E _i represented by Equation 3 below.
[61] [Equation 3]
[62]
[63] In equation 3, And Is the vector for the left and right foot images at time n, And Points to the coordinate vectors of the left and right image projections of P ⁱ . The sum of E _i is performed around an image point called the region of interest (ROI). The first and second terms of Equation 3 represent conventional image matching cost calculation terms for independent left and right temporal tracking. The third term is used to maintain the correspondence between the left image and the right foot image. Three coefficients for the three terms ( And Is a fixed weighting coefficient (ie, equal for all points) user for variable reliability between the three terms.
[64] Weighting factor calculation
[65] In one embodiment, The value for the coefficient is Modulus and Remain smaller than the coefficient, Rain and The ratio is usually kept at ratio 20. The value of the coefficient can be hardcoded separately for each of the 19 points on the face mash as shown in FIG. And The values of may be 1, 1, and 0.05 respectively for an average image area of approximately 100 pixels.
[66] Minimize energy function
[67] When applied to all mesh points, the three weighting coefficients are a global energy function Can be used to minimize In this form of total energy function, stereo tracking works well for short sequences (eg up to 20 to 30 frames). For larger stereo sequences, regulation terms may be added to the cost function E _I (n) that allows all 3D structures to be smoothly transformed as a whole through the stereo sequence and maintain their integrity. Can be. The total energy cost E (n) is then:
[68] The term E _T (n) is a temporal smoothing term and is used to minimize the magnitude of the 3D velocity at each point. The term E _S (n) is a shape smoothing term and is used to minimize the speed difference between neighboring points. This term ensures the integrity of the model by weakly enforcing neighboring points so that neighboring points move together. E _A (n) term is an anthropometric energy cost term that is used to keep the segment length as close as possible to the value calculated in the first frame and to prevent drift on the long tracking sequence. Used to. These three adjustment terms are expressed in the following manner:
[69]
[70] here,Positive coefficientAndChanges from point to point and edge to edge. In one embodiment, all segments that are large strecheIsAndIt is assigned smaller than the value. In another embodiment, point P on a largely deformable area of the faceⁱIs smallWill be assigned. In one embodiment, considerably rigid Known points and segments will be assigned a larger value.
[71] To impose arbitrary moves and stretches of a lot applied to them. And The value for will be assigned higher. For example, the points and edges for the mouth contour will have a smaller coefficient than the points and edges belonging to the nose and eyes. In one embodiment, And The values for are 20000, 20000 and 100 for the average area of an image feature patch of approximately 100 pixels.
[72] 3D Geometry Orbits
[73] Solution shape to minimize total energy function E (n) Can be calculated using gradient descent. That is, all differential shape coordinate vectors By setting the derivative of E (n) to zero To be After derivation of the Jacobian matrix, the solution to the shape is a linear equation. , Where dX is all N vectors It is a 3Nx1 column vector which consists of 3 Dx and e are 3Nx3N vector and 3Nx1 vector, respectively. Once dX is calculated, the shape Get to know. The same process is repeated over the entire stereo sequence to finally get a complete 3D shape trajectory.
[74] Calculate the main geometric vector
[75] 7 illustrates a flow diagram for operation 208 of FIG. 2 to calculate a principal shape vector, according to one embodiment. Initially, operation 208 begins at operation block 702.
[76] In action block 702, the mean shape This is calculated. In particular, the result of stereo tracking is the 3D trajectory of each point P ^{i in} the left camera frame of reference, (For n = 1, ..., M, M is the number of frames in the sequence. P + 1 shape basis vector Is calculated using singular value decomposition (SVD). First, average shape Is calculated as follows:
[77]
[78] In action block 704, the average shape Is subtracted from the entire trajectory. In other words . Result geometry orbit Is then stored as a 3N × M matrix (“M”).
[79]
[80] At operation block 706, SVD is applied to M. In particular, the result of applying SVD to M Is obtained, U = [u ₁ u ₁ ... u _3N ] and V = [v ₁ v ₁ ... v _M ]. U and V are two unitary 3N × 3N and M × M matrices, Is a positive diagonal matrix and monotonically increasing singular values to be. After this decomposition, M is
[81]
[82] In operation block 708, the sum for M is truncated from 3N to p term, which results in an optimal least squares aproximation of the matrix M given by a fixed budget of p vectors. This is due to its orthogonal projection on the linear subspace spanned by the first p vectors u ₁ , ..., u _p (ie, within the sequence) Equivalent to the approximation of each 3D shape). These vectors are exactly the remaining p deformation shape vectors:
[83] If k = 1, .... p,
[84]
[85] The resulting model of the principal shape vector is suitable for the monocular tracking step. For example, if a user makes various facial expressions, the facial expressions may be tracked based on facial expressions exposed to the system during the learning phase 210. Since the vector u _k is unitary, the shape factor shown in equations 1 and 2 Is the mean shape It is in units of. In one embodiment, the units are in centimeters, and four major shape vectors are used to cover the most common facial expressions (eg, mouth and eyebrow movements). However, n of the main shape vectors used may be changed based on the diversity of facial expressions on which tracking has been performed.
[86] Referring back to FIG. 4, the four-dimensional space of variants 411-414 calculated from the stereo sequence shown in FIG. 3 is shown. As shown in FIG. 4, the principal shape vector may correspond to a combination of four major facial movements, for example a smile, a closed mouth, and an eyebrow with raised left and right sides.
[87] Monocular tracking
[88] FIG. 8 illustrates a flow diagram of operation 220 of FIG. 2 for performing monocular tracking using the model computed in training step 210 according to one embodiment. Initially, operation 220 begins at operation block 802 for the image sequence as shown in FIG.
[89] At operation block 802, parameters for shape and pose are predicted from the image sequence using the image measurements. In particular, optical flow tracking techniques can be used to calculate the transient displacement of every point in a given image in two consecutive frames (e.g., frames 1 and 2). Each image point can then be processed independently. Here, in the case of model-based tracking, all points in the model are interconnected through the parameterized 3D model given by equation (1). Thus, the parameters defining the 3D model construction are predicted all at once from the image measurements. Such parameters can be And for pose It includes.
[90] In operation block 804, the optimal shape and pose are obtained using the face model that best fits the next frame. For example, suppose the face model is a face model tracked from the first frame of sequence I ₁ to the (n-1) th frame of I _n-1 . The goal is then to transform the face model that best fits the next frame I _n to perform monocular tracking. And optimal pose To obtain. The following description describes how to find the optimal pose and deformation for monocular tracking.
[91] In order to obtain the optimal pose and deformation, minimization of the cost function Cn, which is the minimum value obtained by the tracking solutions equations 4 and 5, is used.
[92]
[93] Here, π _i is a model-based image projection map defined in Equation 2. The sum of Equation 4 adds all the image points through a small pixel window, for example the ROI. And Run around.
[94] In one embodiment, the first term of equation 4 measures the standard matching cost term, i.e., the first term measures all image mismatches between two consecutive images at model points. However, the second term measures the image mismatch between the local image I _n and the first image I ₁ . This additional term weakly enhances all facial features such that all facial features appear the same for the image from the beginning to the end of the sequence (in a neighboring recognition image). Doing so prevents tracking drift and increases robustness, which is called the drift monitoring energy term.
[95] The two energy terms are weighted proportionally to the other by the scalar variable "e". In one embodiment, the variable e = 0.2, highlighting the tracking cost over the monitoring cost. Therefore, tracking is the optimal pose and deformation update vector And Is the same as predicting. this is And This is realized by setting the derivative of Cn with respect to zero.
[96] [Equation 6]
[97]
[98] Thus, Equation 6 is solved for "s" assuming small movements between two consecutive frames. Let I _{ti be} the extended time derivative defined as
[99] [Equation 7]
[100]
[101] The time derivative I _ti is actually a point Obtained from the neighborhood of if , Then Eq 7 is the actual time difference Is reduced. if The image patch on the previous image I _n-1 is averaged with the image fragment of the first frame, i.e., the second row of equation (7). The resulting fragment is used as a reference for the next image I _n . This process effectively aids monocular tracking operations that "remember" the original appearance of the feature when it was selected in the first image, thereby improving robustness and reducing movement.
[102] Next, I _xi Let x and y image derivatives (image slope) of images I _n in the neighborhood of.
[103]
[104] Under When S = 0 Let be the derivative of the image luminance I _n for s in the neighborhood of.
[105]
[106] I _xi and Since each size is 1 × 2 and 2 × (p + 6), the resulting matrix Is 1 × (p + 6) in size. The optimal shape and pose update vector "s" that satisfies Equation 6 is as follows.
[107] [Equation 8]
[108]
[109] Where the (p + 6) × (p + 6) matrix G and the (p + 6) = 1 vector b are given by:
[110]
[111]
[112] Here, the unique tracking solution "s" is calculated all at once for the entire model, while at the same time each image point in its original form is processed separately. The 3D model is constructed from real data, parameterized into several coefficients and used for tracking. If s can be calculated, then the matrix G is a p + 6 rank. Roughly, each point in the 3D model results in 0, 1 or 2 scalar observation constraints depending on whether it is in an unconstituted region (eg, an edge region, or a complete composed region in the image.) In one embodiment, tracking To make a good 3D model, the total number of constraints collected at all points must be greater than or equal to p + 6 = 10.
[113] Once "s" is calculated, the poses and modifications in time frame n are known. In one embodiment, the same procedure may be repeated several times (eg, 4, 5 times) in a fixed time frame n to refine the prediction. Then the same entire process is repeated for subsequent frames. In one embodiment, the initialization for the 3D model parameters is done manually by first localizing the facial feature at N = 19 for the first image I ₁ . Initial pose and transformation parameters are then created to create an image projection of the model to match manually selected points. Small optimizations are performed to calculate.
[114] Note that the region of interest (ROI) of each model point does not remain constant throughout the entire sequence. Instead, the size and geometry of the ROI is recalculated in every frame based on distance (depth) and orientation of points in space (local surface vertical). The resulting region of interest is a small parallelogram as shown in FIG. 5. In particular, points on the face that are declared “non visible” away from the camera have regions of interest with an assigned size of zero and thus do not contribute to tracking updates.
[115] Therefore, we have described a two-stage system and method for 3D tracking of poses and deformations that do not use cumbersome special markers in, for example, human faces in monocular image sequences. The first stage of the system applies the PCA to the actual stereo tracking data to learn the space of all possible face transformations. As a result, the model approximates any general shape with a linear combination of shape basis vectors. The second stage of the system uses this low complexity deformable model to simultaneously predict the pose and deformation of the face from a single image sequence. This step is known as model-based monocular tracking.
[116] The data-driven approach to model building is suitable for 3D tracking of non-rigid objects and provides a precise and practical alternative to manual building of models using 3D scanners or CAD modelers. In addition, generating a model from real data allows for a great variety of facial deformations to be tracked with fewer parameters than a hand-made model, resulting in improved robustness and tracking accuracy. The system also presents a very promising generalization in enabling multiple people to track using the same 3D model, which is a significant improvement over most other face tracking systems that require different models for each use for tracking. to be.
[117] In the foregoing specification, the invention has been described with reference to specific embodiments. It is evident, however, that various modifications and changes can be made without departing from the scope of the invention and the broader technical spirit of the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

权利要求:
Claims (15)
[1" claim-type="Currently amended] Acquiring stereo data based on an input image sequence;
Building a three-dimensional (3D) model using the obtained stereo data, and
Tracking monocular image sequences using the constructed 3D model
Image processing method comprising a.
[2" claim-type="Currently amended] In claim 1,
The acquiring of stereo data may include acquiring stereo data based on an input image sequence of changing facial expressions.
[3" claim-type="Currently amended] In claim 1,
The step of constructing the 3D model comprises processing the acquired stereo data using principal component analysis (PCA).
[4" claim-type="Currently amended] In claim 3,
The stereo data processed using the PCA allows the 3D model to approximate a general shape to a linear combination of shape vase vectors.
[5" claim-type="Currently amended] In claim 1,
Tracking the monocular image sequence comprises tracking the monocular image sequence of facial deformations using the constructed 3D model.
[6" claim-type="Currently amended] An input device for acquiring stereo data based on the input image sequence, and
A processing apparatus for constructing a three-dimensional (3D) model using the acquired stereo data and tracking a monocular image sequence using the constructed 3D model.
Computing system comprising a.
[7" claim-type="Currently amended] In claim 6,
And the input device acquires the stereo data based on an input image sequence of changing facial expressions.
[8" claim-type="Currently amended] In claim 6,
And the processing device processes the stereo data acquired using a PCA.
[9" claim-type="Currently amended] In claim 6,
And the processing device approximates a general shape to a linear combination of shape basis vectors based on the PCA processed stereo data.
[10" claim-type="Currently amended] In claim 6,
And the processing device tracks the monocular image sequence of facial deformations using the constructed 3D model.
[11" claim-type="Currently amended] When executed by a processor, cause the processor
Acquiring stereo data based on an input image sequence;
Building a three-dimensional (3D) model using the obtained stereo data, and
Tracking the monocular image sequence using the constructed 3D model
A machine-readable medium providing instructions for performing an operation comprising a.
[12" claim-type="Currently amended] In claim 11,
When executed by the processor, cause the processor
A machine-readable medium further providing instructions for performing an operation comprising acquiring stereo data based on an input image sequence of changing facial expressions.
[13" claim-type="Currently amended] In claim 11,
When executed by the processor, cause the processor
A machine-readable medium further providing instructions for performing an operation comprising processing the acquired stereo data using principal component analysis (PCA).
[14" claim-type="Currently amended] In claim 11,
When executed by the processor, cause the processor
And providing an approximation of a general shape to a linear combination of shape basis vectors based on the stereo data processed using the PCA.
[15" claim-type="Currently amended] In claim 11,
When executed by the processor, cause the processor
And providing instructions to perform an operation comprising tracking a monocular image sequence of facial deformations using the constructed 3D model.

类似技术:

公开号 | 公开日 | 专利标题

US9734617B2|2017-08-15|Online modeling for real-time facial animation

CN106164978B|2019-11-01|The method and system of personalized materialization is constructed using deformable mesh is parameterized

US9361723B2|2016-06-07|Method for real-time face animation based on single video camera

Bouaziz et al.2013|Online modeling for realtime facial animation

Valgaerts et al.2012|Lightweight binocular facial performance capture under uncontrolled lighting.

US10586570B2|2020-03-10|Real time video processing for changing proportions of an object in the video

US9262671B2|2016-02-16|Systems, methods, and software for detecting an object in an image

Cosker et al.2011|A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling

Huang et al.2011|Leveraging motion capture and 3D scanning for high-fidelity facial performance acquisition

Joseph Tan et al.2016|Fits like a glove: Rapid and reliable hand shape personalization

Romdhani et al.2002|Face identification by fitting a 3d morphable model using linear shape and texture error functions

Tsai et al.2003|A shape-based approach to the segmentation of medical imagery using level sets

US9076209B2|2015-07-07|Augmented reality method applied to the integration of a pair of spectacles into an image of a face

Huang et al.2006|Shape registration in implicit spaces using information theory and free form deformations

Zhang et al.2008|Spacetime faces: High-resolution capture for~ modeling and animation

Oliver et al.2000|LAFTER: a real-time face and lips tracker with facial expression recognition

Zhou et al.2010|Parametric reshaping of human bodies in images

Hasler et al.2010|Multilinear pose and body shape estimation of dressed subjects from image sets

EP2359333B1|2012-12-19|Visual tracking of objects in images, and segmentation of images

Sclaroff et al.1998|Active blobs

Gall et al.2010|Optimization and filtering for human motion capture

US7254256B2|2007-08-07|Method and computer program product for locating facial features

Plänkers et al.2001|Tracking and modeling people in video sequences

Noh et al.1998|A survey of facial modeling and animation techniques

Vlasic et al.2006|Face transfer with multilinear models

同族专利:

公开号 | 公开日

AU2002303611A1|2002-11-18|

US9400921B2|2016-07-26|

HK1062067A1|2005-08-26|

CN1509456A|2004-06-30|

GB2393065B|2005-04-20|

GB2393065A|2004-03-17|

GB0328400D0|2004-01-14|

CN1294541C|2007-01-10|

WO2002091305A3|2003-09-18|

US20030012408A1|2003-01-16|

KR100571115B1|2006-04-13|

WO2002091305A2|2002-11-14|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

法律状态:
2001-05-09|Priority to US09/852,398

2001-05-09|Priority to US09/852,398

2002-05-02|Application filed by 인텔 코오퍼레이션

2002-05-02|Priority to PCT/US2002/014014

2004-04-28|Publication of KR20040034606A

2006-04-13|Application granted

2006-04-13|Publication of KR100571115B1

优先权:

申请号 | 申请日 | 专利标题

US09/852,398|2001-05-09|

US09/852,398|US9400921B2|2001-05-09|2001-05-09|Method and system using a data-driven model for monocular face tracking|

PCT/US2002/014014|WO2002091305A2|2001-05-09|2002-05-02|Method and system, using a data-driven model for monocular face tracking|

[返回顶部]